evidence(distill): Phase 3 F-DISTILL-SMOKE-001 PASS on gx10 GB10 by noahgift · Pull Request #1828 · paiml/aprender

noahgift · 2026-05-20T05:09:40Z

🎉 F-DISTILL-SMOKE-001 DISCHARGED

Real distillation 1.5B Qwen2.5-Coder teacher → 0.5B Qwen2.5-Coder student on Blackwell GB10 (sm_121):

initial_loss = 7.6746
final_loss   = 7.2036   ← LESS THAN initial
62 steps, 122.7s, no errors

Phase 3 of SPEC-DISTILL-001 is COMPLETE.

What this proves

End-to-end on Blackwell with the full cascade:

✅ Teacher load (1.5B Qwen → 28 transformer blocks)
✅ Student load (0.5B Qwen → 24 transformer blocks)
✅ Forward pass (cuBLAS + pre-warmed PTX)
✅ KD loss computation (kd_step + DistillationLoss)
✅ Backward pass (no JIT-mid-training stream poisoning)
✅ Optimizer step (gradient accumulation + AdamW)
✅ Multi-step convergence (loss decreasing)
✅ Output checkpoint written (student-trained.apr/model.safetensors)

Cascade landed

#	PR	What
1	#1804 PMAT-700-B	cuBLAS prewarm skip
2	#1808 PMAT-698e	workspace cap (2048)
3	#1809 PMAT-698f	APR magic in weights loader
4	#1810 PMAT-698g	non-LoRA backward pre-warm
5	#1813 PMAT-698h	rms_norm_gamma_reduce pre-warm
6	#1815 PMAT-698i	FWD-CACHE diagnostic logging
7	#1817 PMAT-698j	THE root cause — warm! macro key
8	#1820 PMAT-698k	cache-key alignment (rope fwd, rmsnorm eps)
9	#1823 PMAT-698m	smoke setup non-degenerate batch
10	#1824	post-mortem doc
11	#1827 PMAT-698n	rmsnorm pre-warm at 1e-6 + 1e-5

Test plan

Evidence-only PR; the actual code changes already landed across the 11 PRs above. This PR captures the proof-of-success log + dispatch manifest for posterity.

🤖 Generated with Claude Code

2026-05-20 — real distillation 1.5B teacher → 0.5B student on Blackwell GB10 with the full PMAT-698e..n + PMAT-700-B cascade active. initial_loss = 7.6746 final_loss = 7.2036 ← LESS THAN initial 62 steps, 122.7s, no errors F-DISTILL-SMOKE-001 ("final_loss < initial_loss") discharged. Phase 3 of SPEC-DISTILL-001 is COMPLETE. Evidence: - evidence/distill-phase-3-real-kd/dispatch.json — dispatch manifest - evidence/distill-phase-3-real-kd/launch-final-pass.txt — full training log Run dir on gx10: /home/noah/runs/distill-smoke-20260520-070404/ Trained student checkpoint: student-trained.apr/model.safetensors Cascade summary (all merged): - #1804 PMAT-700-B (cuBLAS prewarm skip) - #1808 PMAT-698e (workspace cap) - #1809 PMAT-698f (APR magic in weights loader) - #1810 PMAT-698g (non-LoRA backward pre-warm) - #1813 PMAT-698h (rms_norm_gamma_reduce pre-warm) - #1815 PMAT-698i (FWD-CACHE diagnostic logging) - #1817 PMAT-698j (THE root cause — warm! macro key) - #1820 PMAT-698k (cache-key alignment: rope fwd + rmsnorm eps) - #1823 PMAT-698m (smoke setup: non-degenerate batch) - #1824 (post-mortem doc) - #1827 PMAT-698n (rmsnorm pre-warm at both 1e-6 + 1e-5 eps) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ASSES (Phase 4 ladder) (#1845) 2026-05-20 12:34 UTC — first end-to-end Phase 4 dispatch with real corpus (.bin shards via ShardBatchSource). 0.5B Qwen2.5-Coder teacher → 0.5B student on Blackwell GB10 (sm_121), 100-step trial. initial_loss = 15.6094 final_loss = 6.0095 ← Δ = -9.60 (-62% reduction) 124 steps, 232.4s, 1.87 sec/step This is the first real-corpus Phase 4 dispatch. The synthetic Phase 3 victory (#1828, -0.47 over 62 steps) and the seq_len=256 Stage A smoke (#1833, -6.80) both predicted Phase 4 readiness; Stage C confirms it with strictly better convergence on real data (codeparrot Python tokenized to Qwen vocab, 10 shards / 383 MB). What this validates: - ShardBatchSource (PR #1836, PMAT-PHASE4-STAGE-B-1) reads .bin shards correctly and produces non-degenerate batches - Pipeline integration (PR #1839, PMAT-PHASE4-STAGE-B-2) swaps from synthetic → real source via with_batch_source() cleanly - Dispatch script DATASET_DIR knob (PR #1840) end-to-end through gx10 - Full Phase 4 readiness for the 50K-step Stage D run (compute-gated, requires user check-in per autonomous-mode rule) Cascade math: Stage A: Δloss = -6.80 over 62 steps (synthetic, seq=256) Stage C: Δloss = -9.60 over 124 steps (real corpus, seq=256) Per-step loss decrease: Stage A: -0.110/step Stage C: -0.077/step Stage A's per-step rate is higher because synthetic data has zero variance — every batch is the same identity-mapping task. Real-corpus Stage C has higher variance but covers more concepts, so absolute delta is larger. Phase 4 ladder progress: Stage A (#1833) ✅ MERGED + verified Stage B-1 (#1836) ✅ MERGED Stage B-2 (#1839) ✅ MERGED Stage C-prep (#1840) ✅ MERGED Stage B-1.5 tests (#1841) 🟡 in CI Stage C trial (THIS evidence) ✅ PASSED 2026-05-20 Stage D 50K dispatch ⏳ awaiting user check-in (28h GB10 compute) Stage E HumanEval pass@1 ⏳ Phase 5 (turnkey post-Stage-D) Stage F publish v2 ⏳ Phase 6 (turnkey post-Stage-E) Evidence: - evidence/distill-stage-c-trial/dispatch.json — dispatch manifest - evidence/distill-stage-c-trial/launch-victory.txt — full training log Run dir on gx10: /home/noah/runs/distill-smoke-20260520-123259/ Trained checkpoint: student-trained.apr/model.safetensors Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

… 4 RUNNING (#1851) Captures the live state of the distillation epic as of 2026-05-20: Phase 1 — Teacher provider ✅ MERGED (#1786, #1787) Phase 2 — Student fwd/bwd + KD ✅ MERGED (#1788–#1797) Phase 3 — E2E smoke on Blackwell GB10 ✅ DISCHARGED (#1828) Phase 3b — seq_len=256 scale verify ✅ DISCHARGED (#1833) Phase 4 — 50K training (Stage D) 🟡 RUNNING (PID 196378, gx10) Phase 5 — HumanEval pass@1 ⏳ ready (#1847) Phase 6 — Publish v2 ⏳ ready (#1848) Inserts a new top-of-doc status table that points at: - The 11-PR Blackwell cascade (post-mortem in blackwell-cascade-postmortem.md) - Stage C real-corpus dispatch result (15.61 → 6.01 over 124 steps) - Stage D running with ETA ~22h from 2026-05-20 13:43 UTC - Phase 5/6 turnkey scripts ready post-D This captures institutional knowledge for the team and future sessions: the spec doc reflects what's actually shipped rather than the original plan from 2026-05-18 when the epic was still scaffolded. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 20, 2026 05:09

noahgift added 4 commits May 20, 2026 08:19

Merge branch 'main' into evidence/distill-phase-3-victory

b2134b8

Merge branch 'main' into evidence/distill-phase-3-victory

2389b5a

Merge branch 'main' into evidence/distill-phase-3-victory

cbcb00e

Merge branch 'main' into evidence/distill-phase-3-victory

f299b6b

noahgift merged commit 1898eb6 into main May 20, 2026
10 checks passed

noahgift deleted the evidence/distill-phase-3-victory branch May 20, 2026 09:05

noahgift mentioned this pull request May 20, 2026

docs(spec): SPEC-DISTILL-001 — Phases 1-3 CLOSED, Phase 4 RUNNING (2026-05-20) #1851

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evidence(distill): Phase 3 F-DISTILL-SMOKE-001 PASS on gx10 GB10#1828

evidence(distill): Phase 3 F-DISTILL-SMOKE-001 PASS on gx10 GB10#1828
noahgift merged 5 commits into
mainfrom
evidence/distill-phase-3-victory

noahgift commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 20, 2026

🎉 F-DISTILL-SMOKE-001 DISCHARGED

What this proves

Cascade landed

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant